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Abstract 

The  detection  of  groups  of  parallel  lines  is  important  in  applications  such  as  form  pro¬ 
cessing  and  text  (handwriting)  extraction  in  rule  lined  paper.  These  tasks  can  be  very 
challenging  in  degraded  documents  where  the  lines  are  severely  broken.  In  this  paper  we 
propose  a  novel  model-based  method  which  incorporates  high  level  context  to  detect  these 
lines.  After  preprocessing  and  skew  correction,  we  use  trained  Hidden  Markov  Models 
(HMM)  to  locate  the  optimal  positions  of  all  lines  simultaneously,  based  on  the  Viterbi 
decoding.  The  algorithm  is  trainable,  therefore,  it  can  easily  be  adapted  to  different  ap¬ 
plication  scenarios.  The  experiments  conducted  on  known  form  processing  and  rule  line 
detection  show  our  method  is  robust,  and  achieved  better  results  than  other  widely  used 
line  detection  methods,  such  as  the  Hough  transform,  projection  or  vectorization-based 
methods. 

Keywords:  Line  Detection,  Form  Processing,  Hidden  Markov  Model,  Document  Image 
Analysis 
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Figure  1;  Line  detection  results  for  (a)  a  bank  deposit  form  and  (d)  a  rule  lined  document; 
(b)  and  (e)  are  line  detection  results  using  the  DSCC  method;  (c)  and  (f)  are  our  model 
based  line  detection  results. 


1  Introduction 

The  detection  of  groups  of  parallel  lines  is  of  particular  interest  in  some  applications.  Fig. 
la  shows  an  example  of  a  form  document  with  a  group  of  horizontal  and  vertical  parallel 
lines  respectively.  These  lines  provide  crucial  information  for  model-based  form  analysis 
and  processing,  such  as  form  identihcation,  form  registration,  content  understanding,  etc. 
Another  example  is  a  rule  lined  document  shown  in  Fig.  Id.  These  lines  are  originally 
printed  on  the  paper  to  guide  users’  writing.  After  digitization  they  will,  however,  touch 
text  and  cause  problems  for  further  document  analysis  such  as  segmentation  and  recog¬ 
nition.  It  is  important  that  those  lines  can  be  detected  and  removed  before  we  feed  the 
text  to  an  Optical  Character  Recognition  (OCR)  engine. 

Many  line  detection  algorithms  have  been  presented  in  the  literature  [2,5,13,21,33,35]. 
They  can  work  well  on  relatively  clean  documents  with  solid  or  mildly  broken  lines,  but 
the  performance  will  be  significantly  deteriorated  if  lines  are  severely  broken  due  to  the 
low  image  quality,  or  mix,  touch,  and/or  overlap  with  text.  Figs,  lb  and  le  show  the 
DSCC  (Directional  Singly  Connected  Chain)  based  line  detection  results  for  the  form 
and  rule  lined  documents.  We  can  see  only  a  few  lines  are  partially  detected  in  both 
cases  due  to  severe  brokenness,  and  a  long  horizontal  line  is  falsely  detected  on  the  form 
document  (Fig.  lb)  since  several  horizontal  strokes  of  Chinese  characters  happen  to  be 
close  enough  and  lie  on  the  same  line.  It  is  very  difficult,  if  not  impossible,  to  reliably 
detect  these  lines  individually. 

To  handle  these  problems  the  context  is  often  required  to  refine  the  initial  detection. 
For  example,  in  form  processing  most  form  cells  are  rectangular;  in  known  form  processing 
the  number  of  lines  and  the  gaps  among  these  lines  can  be  used  as  a  priori  knowledge 
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and  stored  as  references  in  form  templates.  These  ideas  have  been  presented  in  previous 
work  to  improve  detection  accuracy  and  reduce  false  alarms  [1,9,30,35].  But  the  usage  of 
the  priori  knowledge  in  the  above  applications  is  in  an  ad  hoc  way  and  lacks  a  systematic 
representation. 

In  this  paper  we  propose  a  model-based  method  which  incorporates  context  to  op¬ 
timally  detect  parallel  lines  systematically.  Under  the  model,  lines  are  detected  by  a 
Hidden  Markov  Model  (HMM)  decoding  process  which  can  determine  the  positions  of  all 
lines  simultaneously.  Rather  than  detecting  lines  directly  on  original  images  [2,21,35], 
we  use  a  DSCC-based  scheme  to  hlter  text  as  a  preprocessing  step  so  the  interference 
with  text  can  be  minimized.  We  then  use  a  coarse-to-hne  approach  to  estimate  the  skew 
angle  of  the  document.  After  deskewing  the  document  we  perform  horizontal  and/or  ver¬ 
tical  projections.  Rather  than  simply  treating  the  peaks  of  the  projection  prohle  as  the 
line  positions  [2,21],  we  model  the  projection  prohle  with  a  HMM  model  so  the  context 
among  these  lines  can  be  incorporated.  The  Viterbi  algorithm  is  then  used  to  search  the 
optimal  positions  of  these  lines  simultaneously  from  the  projection  prohle.  The  model 
is  trainable,  therefore,  it  can  be  easily  adapted  for  different  applications.  Two  different 
detection  scenarios  (known  form  processing  and  rule  line  detection)  presented  in  the  pa¬ 
per  illustrate  the  strength  and  advantage  of  our  method.  The  experimental  results  show 
our  method  is  robust,  and  can  detect  lines  with  high  accuracy  and  low  false  alarms  in 
degraded  documents. 

The  remainder  of  this  paper  is  organized  as  follows.  In  the  next  section  we  briehy 
survey  the  literature  of  line  detection  and  form  processing,  followed  by  preprocessing  in 
Section  3.  Section  4  presents  our  HMM-based  line  detection  algorithm  in  detail.  Two 
application  scenarios,  known  form  processing  and  rule  line  detection,  are  described  in 
Section  5.  We  demonstrate  the  robustness  of  our  method  with  quantitative  evaluation 
experiments  in  Section  6.  The  paper  concludes  with  a  brief  summary  and  a  discussion 
of  the  future  work. 

2  Related  Work 
2.1  Line  Detection 

Line  detection  is  widely  used  in  table  detection  and  interpretation  [2,33,35],  engineering 
graph  interpretation  [4],  and  bank  check/invoice  processing  [1,30].  The  algorithms  pre¬ 
sented  can  be  broadly  divided  into  two  types:  Hough  transform  based  and  vectorization 
based  [24].  The  Hough  transform  converts  the  global  pattern  detection  problem  in  an 
image  space  to  a  local  pattern  (ideally  a  point)  detection  problem  in  a  transformed  pa¬ 
rameter  space  [11, 13].  To  detect  a  straight  line,  each  black  pixel  {x,  y)  in  an  image  space 
is  transformed  into  a  sinusoidal  curve  in  the  Hough  parameter  space 

p  =  xcosO  +  ysmO  (1) 

After  transformation,  collinear  points  (xi,  yt)  in  the  image  space  intersect  at  a  point  (p,  9) 
in  the  Hough  parameter  space.  Therefore,  a  peak  in  the  transformed  space  provides  strong 
evidence  that  a  corresponding  straight  line  exists  in  the  image.  The  Hough  transform 
can  detect  dashed  and  mildly  broken  lines.  However,  it  is  very  time  consuming.  To 
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reduce  the  computational  cost,  a  projection  based  method  is  proposed  in  [21]  to  detect 
form  frame  lines  by  limiting  the  search  orientations  since  usually  only  horizontal  and/or 
vertical  lines  exist  in  form  docnments.  The  method  deskews  the  document  hrst,  and 
then  detects  the  peaks  of  the  horizontal  and  vertical  projection  prohles  as  lines.  It  can 
be  viewed  as  a  special  case  of  the  Hough  transform  by  searching  9  only  around  0°  and 
90°.  The  method  will  fail  if  the  projection  of  a  line  does  not  form  a  peak  on  the  prohle 
when  it  mixes  with  text,  or  the  estimated  skew  angle  is  not  accnrate  enongh,  or  the  lines 
are  too  short  or  severely  broken.  Chen  and  Lee  proposed  the  strip  projection  method  to 
alleviate  this  problem  since  lines  are  more  likely  to  form  peaks  on  the  projection  prohle 
in  a  small  region  [2].  For  horizontal  line  detection,  they  hrst  divide  an  image  into  several 
vertical  strips  with  equal  width,  and  then  perform  horizontal  projection  in  each  strip. 
The  detected  co-linear  line  segments  in  each  strip  are  linked  to  form  a  whole  line. 

Vectorization  based  line  detection  algorithms  extract  vectors  from  an  image  hrst,  and 
then  merge  vectors  into  lines  [3,24].  A  vector  can  be  dehned  as  a  collection  of  pixels 
which  satisfy  the  pre-dehned  criterion.  It  can  be  a  chain  of  pixels,  a  line  segment,  or  an 
arc.  Using  vectors  instead  of  pixels  can  signihcantly  reduce  the  number  of  elements  to  be 
processed,  typically  from  100:1  to  10:1.  The  major  difference  among  existing  vectorization 
based  algorithms  is  the  dehnition  and  extraction  of  vectors,  while  the  merging  process  is 
often  rule-based  and  very  similar  for  different  algorithms. 

Thinning  is  a  common  method  to  extract  vectors.  It  uses  an  iterative  boundary  ero¬ 
sion  process  to  remove  outer  pixels  until  only  a  skeleton  of  pixel  chains  remains  [29].  It 
can  maintain  connectivity,  bnt  also  tends  to  create  noisy  jnnctions  at  corners,  intersec¬ 
tions,  and  branches.  Medial  line  methods,  on  the  other  hand,  extract  image  contonrs 
as  pixel  chains  hrst.  Then  the  middle  points  of  the  perpendicnlar  lines,  which  project 
from  one  side  of  the  contour  to  the  other,  form  a  medial  line  [14].  The  methods  may 
miss  pairs  of  contour  lines  at  branches,  so  postprocessing  is  often  required  to  reduce  this 
distortion  [10].  The  result  of  either  thinning  or  medial  line  methods  is  a  chain  of  pixels, 
and  a  line  can  be  detected  by  approximating  the  pixel  chain.  Recently,  the  Sparse  Pixel 
Vectorization  (SPV)  algorithm,  proposed  by  Dori  et  al.  [5],  does  not  use  contours  to  get 
medial  lines.  It  traces  the  medial  axis  points  of  consecutive  horizontal  or  vertical  pixel 
runs  until  some  constraints  are  violated.  Each  continuous  trace  is  a  vector,  representing 
a  bar  or  an  arc.  SPV  often  achieves  better  vectorization  results  than  other  medial  line 
methods,  but  the  vector  extraction  is  complicate,  and  often  needs  postprocessing  to  refine 
the  resnlts. 

As  a  run-length  based  approach,  Block  Adjacency  Graph  (BAG)  is  a  generic  structure 
to  represent  an  image  [33,34].  BAG  is  defined  as  Q{N,E),  where  is  a  set  of  block 
nodes  and  U  is  a  set  of  edges  indicating  the  connection  between  two  nodes.  Each  node  is 
a  block  which  contains  either  one  or  several  horizontal  run  lengths  adjacently  connected 
in  vertical  direction  and  aligned  on  both  left  and  right  sides  within  a  given  tolerance.  A 
Line  is  detected  by  searching  a  connected  sub- graph  in  the  BAG  with  large  longitude.  As 
a  generic  image  presentation  method,  a  BAG  does  not  consider  the  special  characteristics 
of  a  straight  line.  Therefore,  it  can  only  detect  unbroken  lines  within  a  small  skew  range 
(<  V). 

Recently  Zheng  et  al.  presented  a  vectorization  based  algorithm  called  Directional 
Singly  Connected  Chain  (DSCC)  method  [35].  A  DSCC  is  a  chain  of  rnn-lengths  which 
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are  singly  connected.  Each  DSCC  represents  a  line  segment  and  multiple  non-overlapped 
DSCCs  are  merged  into  a  line,  based  on  pre-defined  rules.  The  basic  characteristic  of 
a  line  is  that  it  has  only  one  running  direction.  When  a  junction  is  encountered,  the 
merging  process  stops  and  a  new  DSCC  is  generated. 

2.2  Model-Based  Form  Processing 

Millions  of  form  documents,  such  as  health  insurance  forms,  checks,  and  bank  slips,  are 
being  processed  everyday.  Some  systems  have  been  developed  to  process  these  forms 
automatically  or  semi-automatically  [1,8,30].  Form  processing  can  be  categorized  as 
unknown  and  known  form  processing  [1].  Unknown  form  processing  assumes  no  a  priori 
knowledge  from  the  input  forms,  and  extracts  all  information  based  on  low  level  image 
analysis.  Errors  are  often  expected  and  user  assistance  is  required.  Known  form  process¬ 
ing,  on  the  other  hand,  is  designed  to  process  a  pre-defined  set  of  forms,  where  a  priori 
information  can  be  stored  as  templates  in  the  database  to  guide  the  later  processing.  It 
is  widely  used  in  banks,  post  offices,  and  tax  offices  where  the  types  of  forms  can  be  pre¬ 
defined.  For  an  input  form,  the  system  first  selects  the  template  which  matches  it  best 
{form  identification).  Then  some  anchors  (such  as  specific  marks,  form  frame  lines,  etc.) 
are  detected  for  the  registration  so  the  variations  produced  by  scanning  (e.g.  rotation, 
translation,  and  scaling)  can  be  compensated.  Finally  the  identified  template  is  used  to 
guide  the  system  to  recognize  interested  fields  on  the  form  (different  OCR  engines  can 
be  used  for  different  fields),  and  output  the  recognition  results  to  a  database.  Though 
special  anchors  may  be  available  to  facilitate  the  form  identification  and  registration  for 
specially  designed  forms,  more  general  approaches  use  features  related  to  frame  lines  ex¬ 
plicitly  or  implicitly,  such  as  frame  lines  [1,30-32],  form  cells  [22],  and  the  cross  points  of 
frame  lines  [6],  etc.,  for  form  identification  and  registration.  Robust  detection  of  frame 
lines  is  crucial  in  these  approaches. 

For  line-based  image  registration,  model-based  approaches  are  often  used  to  find  the 
correspondence  between  the  detected  lines  and  those  stored  in  the  form  template  [1,8,30]. 
In  [30],  Tang  and  Suen  first  detect  an  anchor  line,  from  which  the  approximate  locations 
of  other  lines  can  be  determined  based  on  the  form  template.  Then  the  system  refines 
the  detection  by  searching  in  areas  around  these  locations.  The  correspondence  between 
the  detected  anchor  line  and  those  in  the  form  template  may  not  be  unique.  Cesarini 
et  al.  proposed  a  hypothesis  and  verification  paradigm  to  solve  the  correspondence  and 
detection  simultaneously  [1].  A  hypothesis  is  generated  for  each  correspondence,  and  then 
the  system  searches  the  expected  lines  in  the  surrounding  areas  to  verify  the  hypothesis. 
One  shortcoming  of  the  approach  is  that  the  hypothesis  and  verification  need  to  be  defined 
for  different  types  of  forms.  The  false  alarms  are  reduced  due  to  the  use  of  templates, 
but  miss  detection  of  broken  lines  is  reported  [30] . 

3  Preprocessing 

The  purpose  of  preprocessing  is  two-fold:  first  we  filter  out  text  strokes  so  their  effect 
can  be  diminished;  second,  we  deskew  the  document  so  the  parallel  lines  can  be  oriented 
horizontally  or  vertically. 
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Figure  2:  The  definition  of  a  horizontal  DSCC.  (a)  An  example  of  DSCCs;  (b)  extracted 
DSCCs  (represented  in  red  color)  where  a  text  stroke  crosses  a  line. 


3.1  Text  Filtering  Based  on  DSCC 

3.1.1  Definition  of  DSCC 

We  define  two  types  of  DSCCs:  the  horizontal  one  and  the  vertical  one,  which  are  used 
to  detect  horizontal  and  vertical  frame  lines  respectively  in  [35].  Take  the  horizontal 
DSCC  for  example:  a  horizontal  DSCC  Ch  consists  of  a  black  pixel  run-length  array 

•  •  •  Rm,  where  Ri  is  a  vertical  run-length  with  one  pixel  width 


Ri{xi,ysi,yei) 


P{x,  y)  =  1,  for  x  =  Xi,y  E  [ysi,  yci] 
and  p{xi,  ySi  -  1)  =  p{xi,  yet  -|-  1)  =  0 


(2) 


where  p{x,  y)  is  the  value  of  pixel  (x,  y)  with  1  representing  black  pixels,  and  0  repre¬ 
senting  white  pixels;  Xi,  ySi  and  ye*  designate  x,  starting  y,  and  ending  y  coordinates 
of  Ri  respectively.  Two  neighboring  run-lengths  Ri  and  are  merged  into  a  DSCC 
chain  if  they  are  singly  connected  in  the  horizontal  direction.  As  shown  in  Fig.  2,  the 
single  connection  means  that  at  each  side  of  Ri{l  <  i  <  m),  there  is  one  and  only  one 
connected  run-length.  In  this  example,  R1R2  •  ■  •  Rj,  R11R12R13,  Rs,  R9,  Rio,  Ru  and 
i?i5  are  extracted  as  DSCCs.  The  definition  of  vertical  singly  connected  chain  is  very 
similar  to  the  horizontal  one. 

The  most  important  property  of  a  line  is  the  single  connection  along  its  running 
direction.  An  ideal  line  consists  of  only  one  DSCC.  A  real  line  often  consists  of  multiple 
non-overlapping  DSCCs.  Fig.  2b  shows  an  example  of  extracted  DSCCs  (represented  in 
red  color)  of  a  text  stroke  crossing  a  line.  We  can  see  that  the  line  is  broken  into  several 
line  segments  (DSCCs)  on  the  touching  area.  If  the  image  quality  is  reasonable,  then  a 
line  can  be  detected  by  merging  those  DSCCs  with  the  similar  orientation  [35].  In  our 
case  we  use  it  to  remove  text  and  preserve  line  segments. 


3.1.2  Filtering 

As  shown  in  Fig.  2b,  a  DSCC  can  be  a  text  stroke  or  a  line  segment.  We  observed  that  a 
line  segment  often  has  a  smaller  variation  from  the  desired  orientation  and  larger  aspect 
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(c)  (d)  (e) 

Figure  3:  Text  filtering  using  DSCC.  (a)  A  document  image  with  rule  lines;  (b)  result 
of  text  filtering  in  horizontal  direction;  (c)  a  form  document  image;  (d)  result  of  text 
filtering  in  horizontal  direction;  (e)  result  of  text  filtering  in  vertical  direction. 


ratio.  We  use  an  ellipse  to  model  the  shape  of  a  DSCC,  and  calculate  the  orientation  6, 
the  hrst  and  second  axes  a  and  b  of  each  DSCC  as  follows 


1— 1 

1 

1 

II 

(3) 

9  =  O.Stan  ^  i - ^ 

V/^20  ~  /^02  / 

(4) 

2[/^20  +  M02  +  \l (^20  ~  ^02)^  +  4/iii] 

(5) 

2[ti20  +  7^02  ~  \J (7^20  ~  ^^02)^  +  4^11. 

(6) 

7^00 

where  I{x,y)  represents  a  pixel  in  the  DSCC,  x  and  y  are  the  means  of  x  and  y  coor¬ 
dinates,  and  Upq  is  the  central  moment.  For  horizontal  line  detection,  we  only  preserve 
those  DSCCs  with  either  very  small  sizes  (max{a,  b}  <  Ti),  or  large  aspect  ratios  within 
a  specihed  orientation  angle  {a/b  >  T2  and  0  G  [—45*’,  45'’]).  Ti  and  T2  are  thresholds 
determined  experimentally.  The  hrst  condition  preserves  small  DSCCs  which  can  be 
parts  of  a  broken  line,  or  the  touching  areas  of  lines  and  text;  and  the  second  one  pre¬ 
serves  large  DSCCs  which  are  likely  to  be  horizontal  line  segments.  Similar  conditions 
hold  for  vertical  lines  except  for  the  orientation  angle.  Fig.  3  shows  some  examples  of 
text  Hltering.  We  can  see  that  most  text  strokes  are  Hltered  and  the  line  segments  are 
preserved. 
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Figure  4:  Refinement  of  the  skew  angle  estimation,  (a)  Horizontal  projection  along  the 
skew  angle;  (b)  search  range. 


3.2  Skew  Angle  Estimation 

Our  skew  angle  estimation  is  based  on  the  line  segments  extracted.  After  filtering  text 
strokes,  we  merge  the  preserved  DSCCs  into  longer  line  segments  and  use  a  two-step 
coarse-to-fine  method  to  estimate  the  skew  angle.  In  the  coarse  estimation,  we  construct 
a  weighted  angle  histogram  of  all  extracted  lines  as  follows 

Clear  all  entries  of  the  histogram  h 
For  each  line  i  do 

9i  =  orientation  of  line  i 
Weight  Wi  =  length  of  line  i 

h{0i)  =  h{6i)  +  Wi 

End 

The  length  of  a  line  is  used  as  the  weight  so  the  long  lines  can  be  emphasized.  The  angle 
6c  with  the  largest  count  in  the  histogram  is  taken  as  the  coarse  estimate  of  the  skew 
angle.  We  refine  the  estimate  further  by  performing  projection  along  the  angles  within 
a  small  range  around  6c,  as  shown  in  Fig.  4.  The  angle  6  maximizing  the  variance  of  the 
projection  is  the  refined  estimate 
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arg  max 

ee[ec-i.2°  ,ec+i.2°] 


Var{h{y,d)) 


(7) 


where  h{y,  6)  is  the  projection  along  the  skew  angle  9.  Experiments  conducted  on  our 
database  containing  168  Arabic  documents  show  the  errors  of  coarse  skew  estimate  are 
within  the  range  of  [—1.16°,  1.17°],  and  reduced  to  [—0.56°,  0.18°]  after  refinement.  After 
skew  angle  estimation,  the  skew  distortion  is  corrected  before  we  perform  horizontal  or 
vertical  projection. 


4  HMM  Model-Based  Parallel  Line  Detection 

In  the  following  description,  we  use  horizontal  line  detection  as  an  example  to  illustrate 
the  proposed  method.  The  extension  to  vertical  line  detection  is  straightforward.  A 
stochastic  model,  M.{yi,  y2, . . . ,  yN),  is  proposed  for  a  group  of  parallel  lines,  where  N  is 


the  number  of  lines,  and  yi,i  =  1,  2, . . . ,  is  the  vertical  position  of  the  line.  The 
line  gap  Qi  between  two  neighboring  lines  i  and  i  +  1  is  defined  as 

9i  =  Vi+i  -  Vi  (8) 

As  we  show  later,  the  line  positions  =  1,2,  form  a  Markov  chain  if  the 

variations  in  line  gaps  are  independent.  As  they  are  not  observable  in  the  projection 
profile,  a  HMM  model  is  more  suitable  for  modeling  the  profile.  The  line  positions  can 
be  detected  by  decoding  the  HMM  model. 

4.1  Markov  Chain 

The  Markov  property  of  a  sequence  of  events  is  well  studied  in  the  literature  [27].  Consider 
a  system  that  stays  at  one  of  a  set  of  N  distinct  states,  S'!,  <S'2, . . . ,  at  any  sampling 
time  t.  It  undergoes  a  change  of  state  according  to  a  set  of  probabilities  associated  with 
the  state  during  the  period  between  two  successive  sampling  times.  For  a  Markov  chain 
(the  first  order),  the  probability  of  staying  at  state  qt  only  depends  on  the  previous  state 
Qt-i 

P[Qt  1  qt—2  ...,</!  ‘S'fei]  P[Qt  1  (9) 

If  the  state  transition  probability  is  independent  of  time  t,  then  the  Markov  chain  is  said 
to  be  homogeneous 

P[qt  =  Sj\qt-i  =  Si\  =  ttij  l<iJ<N  (10) 

We  can  show  that  line  positions  {Yi,i  =  1,2,  ...,W}  form  a  Markov  chain,  if  the 
variations  in  line  gaps  are  independent.  Here  we  use  uppercase  characters  to  represent 
random  variables  (e.g.  Fj)  and  lowercase  characters  to  represent  the  value  of  the  random 
variables  (e.g.  yi). 

Theorem  1:  Let  F),z  =  1,2,  ...,W  be  line  positions,  and  Gi  =  F)+i  —  Yi,i  = 
1, . . . ,  W  —  1  be  line  gaps.  If  {Gi}  are  independent,  then  {Ij}  form  a  Markov  chain. 

P(ij|ri,  F2,  . . . ,  i^*-i)  =  P{nY,_i)  (11) 

Proof: 

p(ij|ri,r2,...,F*-i)  =  P(G,_i  +  F,_i|ri,r2,...,i^*-i) 

=  P(G,_i|Fi,F2,...,F*-i) 

=  P{Gi-i\Gi,G2,  ■  ■  ■  ,Gi-2,Yi-i)  (12) 

Since,  {Gi}  are  independent,  we  have 

P(F,|ri,F2,...,l^*-i)  =  P(G', _i|lj_i) 

=  p(G',_i  +  y,_i|T,_i) 

=  p(y|y-i)  (13) 

Therefore,  {Fj}  form  a  Markov  chain. 

In  random  process,  {Fj}  is  called  an  independent  increment  process,  which  includes 
several  well  known  random  processes,  e.g.  Brownian  motion  process,  Poisson  process, 
etc  [7]. 
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4.2  HMM  Model 


In  many  applications,  the  actual  state  sequence  is  not  observable.  The  resulting  model 
(which  is  called  a  hidden  Markov  model)  is  a  doubly  embedded  stochastic  process  with 
an  underlying  stochastic  process  that  is  not  observable,  but  can  only  be  observed  through 
another  stochastic  process  that  produces  the  sequence  of  observations.  The  elements  of 
a  standard  discrete  HMM  model  are 

1)  iV,  the  number  of  the  states  in  the  model. 

2)  M,  the  number  of  distinct  observation  symbols  per  state. 

3)  H  =  {ttij},  the  state  transition  probability  matrix. 

4)  B  =  {bij},  the  probability  distribution  matrix  of  the  observation  symbols. 

5)  TT,  the  initial  state  distribution. 

HMM  models  can  model  some  1-D  signals  well,  and  have  achieved  great  success  in  speech 
[27]  and  handwriting  recognition  [26]. 

In  our  application,  we  can  only  observe  the  projection  prohle  hk 


P{Hk  —  hk\Yi  —  yi, . . . ,  Tat  —  yjv) 


P{Hk  =  k  =  Di)  A  line  is  on  k 
P{Hk  =  /ife|Vi,  k  ^  yi)  No  lines  are  on  k 


(14) 

Therefore,  the  projection  prohle  can  be  modeled  with  a  HMM  model.  A  standard  HMM 
model  is  shown  in  Fig.  5a,  where  St  and  Sb  are  the  states  representing  top  and  bottom 
image  borders,  Si^i,  i  =  1,2, . . . ,  N  represent  lines,  and  SG,i,  i  =  1,2, . . . ,  N  —  1  represent 
the  line  gaps  between  lines  i  and  i  +  1. 

One  weakness  of  conventional  HMMs  is  the  modeling  of  state  duration.  The  inherent 
duration  probability  distribution  pi{d),d  =  1,2, ... ,  associated  with  state  Sg^  is 


Pi(d)  =  (auY  ‘(1  -  Ojj) 


(16) 


where  an  is  a  self  transition  probability.  The  exponential  state  duration  distribution 
is  inappropriate  for  our  applications.  Instead  we  explicitly  model  the  duration  dis¬ 
tributions.  The  model  with  explicit  state  duration  is  shown  in  Fig.  5b,  where  the 
stochastic  property  of  the  model  is  incorporated  into  the  state  duration  distributions 
Prid),  Psid),  Pi{d),  i  =  1,2, . . . ,  N  —  1.  For  some  applications,  the  quality  of  the  model¬ 
ing  is  signihcantly  improved  when  explicit  state  duration  distributions  are  used  [19]. 


4.3  HMM  Model  Parameter  Estimation 

The  major  drawback  of  an  explicit  duration  HMM  model  is  that  it  signihcantly  increases 
computational  costs  for  model  training.  With  a  traditional  forward-backward  training 
algorithm  (a  type  of  EM  algorithm),  the  re-estimation  problem  for  a  variable  duration 
HMM  is  more  difficult  than  that  for  a  standard  HMM  [27].  Fortunately,  in  our  case, 
we  can  directly  get  the  HMM  parameters  from  groundtruth  since  the  states  explicitly 
correspond  to  image  components.  Therefore,  the  forward-backward  training  algorithm 
is  not  needed.  We  set  duration  probabilities  of  states  St  and  Sb  to  uniform  distribution 
within  a  range.  The  duration  probabilities  of  states  S'g,*,  i  =  1,2, . . . ,  N  —  1  are  estimated 
directly  from  the  groundtruth. 
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(b) 


Figure  5:  HMM  models  for  a  projection  profile,  (a)  A  standard  HMM  model;  (b)  a  HMM 
model  with  explicit  state  duration. 


The  observation  comes  from  the  projection  profile  {fife}-  The  large  number  of  ob¬ 
servation  symbols  would  prevent  us  from  estimating  the  model  parameters  reliably  with 
limited  training  samples.  There  are  two  methods  to  reduce  the  number  of  parameters  of 
the  model.  One  is  to  model  the  distribution  of  the  observation  as  a  Gaussian  distribu¬ 
tion  [27],  so  only  the  mean  and  variance  of  the  Gaussian  distribution  are  needed  to  be 
estimated.  For  known  form  processing,  we  find  the  projections  of  a  line  over  multiple 
form  instances  can  be  well  modeled  as  a  Gaussian  distribution.  Another  method  quan¬ 
tizes  the  projection  profile  into  several  levels.  For  rule  line  detection,  the  image  quality 
varies  significantly  among  different  images.  The  distribution  of  the  observation  does  not 
follow  the  Gauss  distribution.  Therefore,  we  quantize  hk  into  K  levels  (i7  =  5  in  our 
experiments  for  rule  line  detection).  The  probabilities  of  each  level  can  be  estimated 
from  the  groundtruth. 

The  HMM  parameters  estimated  directly  from  the  groundtruthed  data  set  are  not 
optimal  due  to  the  sparseness  of  the  training  data.  For  example,  some  entries  of  the 
line  gap  distribution  in  Table  1  do  not  appear  or  only  appear  a  few  times.  Parame¬ 
ter  sharing,  a  technique  used  in  Neural  Network  to  train  the  parameters  with  limited 
training  samples  [18,28],  is  used  in  our  approach.  For  example,  we  let  non-line  states 
St,  Sb,  Sg,i,  ■  ■  ■ ,  Sg,n-i  share  the  same  observation  probability  distributions  since  the 
observations  of  these  states  are  the  same:  the  projections  of  noise  and  remaining  text 
strokes  after  filtering.  For  rule  line  detection,  we  further  combine  all  line  states  into 
one  state,  and  all  non-line  states  into  another  state,  which  significantly  reduces  the  pa¬ 
rameters  of  the  model.  For  line  gap  distribution  estimation,  we  assume  the  distribution 
is  symmetric  around  the  mean  value.  Therefore,  data  smoothing  techniques,  originally 
proposed  in  natural  language  processing  [15],  can  be  used 


C'{g,  +  k)  =  C'{g,  -  k) 


C {gi  -\-  k)  -\-  C {gi  —  k) 


(16) 


where  gi  is  the  mean  value  of  line  gap  Gi,  C{k)  is  the  number  of  instances  of  Gi  with 
value  k  in  the  training  set,  and  G'{k)  is  the  smoothed  result  after  imposing  symmetric 
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regularization.  Finally,  we  set  the  empty  entries  to  the  minimal  value  of  all  non-zero 
entries.  Suppose  the  maximal  variation  of  the  line  gap  Gi  is  K.  For  k  G  [—K,K],  the 
final  smoothed  result  is 


C"{gi  +  k) 


C'{g^  +  k)  iiC'{g,  +  k)^0 

C'(gi  +  i)  it  C'(gi  +  k)  =  0 


(17) 


C"{k)  can  be  converted  to  probability  by  normalization.  We  will  illustrate  the  data 
smoothing  with  examples  in  the  following  section. 

The  ultimate  goal  of  training  is  to  search  the  optimal  HMM  model  parameters  to 
minimize  the  line  detection  error.  The  estimated  parameters  from  the  training  data  can 
produce  reasonable  results,  however  they  do  not  minimize  the  line  detection  error  rate. 
Generally  the  error  criterion  is  a  very  complex  function  of  the  model  parameters  without 
a  closed  form.  A  direct  searching  algorithm  can  be  used  to  solve  such  optimization 
problems.  In  our  case,  the  simplex  search  method  proposed  by  Nelder  and  Mead  is  used 
to  minimize  the  detection  error  [25]. 


4.4  Decoding  of  the  HMM  Model 

Given  the  observation  sequence  O  =  hk,k  =  1,2, ...  ,T,  and  the  HMM  model  A,  we 
want  to  search  an  optimal  state  sequence  Q  =  qiq2  . .  .qr  to  maximize  P{Q\0,  A),  which 
is  equivalent  to  maximizing  P{Q,0\X).  Normally,  the  Viterbi  algorithm,  a  dynamic 
programming  method,  is  used  to  decode  HMM  models.  A  matrix  v  with  dimension 
T  X  {N  +  1)  is  defined  and  updated  in  the  Viterbi  algorithm,  and 

v{t,n)=  max  P[qi,q2, . . .  ,qt  =  SL,n,hi,h2, . . .  ,ht\X\  (18) 

11,12,— ,qt-i 

is  the  best  decoding  score  at  time  t,  which  accounts  for  the  first  t  observations  and  ends 
in  state  The  sequence  qi,q2,  ■  ■  ■ ,  qt-i  maximizing  the  probability  in  Eq.  (18)  is  the 
best  decoding  result  until  time  t  if  we  decode  state  qt  as  the  line. 

Suppose  the  minimal  and  maximal  state  durations  of  states  SG,n  are  and  5^+,  and 
the  durations  of  St  and  Sb  are  uniformly  distributed  in  [0,  hr]  and  [0,5^],  respectively. 
The  complete  procedure  of  decoding  is  stated  as  follows 

1.  Clear  all  entries  of  matrix  v. 

2.  For  1  <  i  <  5t,  decode  the  first  i  —  1  observations  as  St  (the  top  image  border) 
and  observation  i  as  S^^i 

=  r  \.P{hi\qi  =  Sl,i)  n  =  ^t),  (19) 

dT  +  1 

where  P{hi\qi  =  Sl,i)  is  the  probability  of  observing  hi  if  the  system  enters  state 
Sl,i  at  time  v,  ~  ^t)  is  the  probability  of  observing  the  first  i  —  1 

observations  if  the  system  stays  at  state  St  during  the  time  period  from  1  to  i  —  1; 
and  is  the  probability  of  the  model  staying  at  St  for  i  —  1  consecutive  periods. 
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3.  Set  t  =  1. 


4.  For  n  =  1  to 

For  j  =  Sn-  to  Sn+ 

v'(t  +  j,n)  =  v(t,n)Pr^(j)F(ht+jjqt+j  =  SL,n+i)  H  =  5'G,n)  (20) 

/c=t-|- 1 

v(t  +  j,n)  =  max{v(t  +  j,  n),  v'(t  +  j,n)}  (21) 


End  loop  of  j 
End  loop  of  n 

Here  Pn{j)  is  the  probability  of  staying  at  state  SG,n  with  j  consecutive  times; 
P{ht+j\qt+j  =  SL,n+i)  is  the  probability  of  observing  observation  ht+j  if  the  system 
enters  Si^n+i  at  time  t  +  j,  which  corresponds  to  a  new  line;  and  nltft+i  P{hk\qk  = 
SG,n)  is  the  probability  of  observing  sequence  ht+i  to  ht+j-i  if  the  system  stays 
at  state  SG,n  during  this  time  period,  which  corresponds  to  a  line  gap.  Eq.  (21) 
updates  the  optimal  partial  detection  result. 


5.  If  t  >  T  —  Sb,  decode  the  following  sequence  as  the  bottom  image  border. 


v'iT,N  +  l)  =  v{t,N)-^  n  Pihk\qk  =  SB)  (22) 

Ob  +  1  k=t+i 

v{T,N  +  l)  =  max{n(T,iV  +  l),  v'{T,N  +  l)}  (23) 


6.  If  t  <  T,  then  t  =  t  +  1,  and  go  to  step  4. 

Eor  each  t,  the  algorithm  remembers  the  best  decoding  path  until  time  t.  After 
decoding, 

n(r,  iV  +  1)  =  maxP((5,  0|A)  (24) 

Q 

is  the  probability  of  detecting  lines  given  the  model,  which  can  be  regarded  as  detection 
confidence.  The  sequence  qi,q2, . . .  ,qT  which  achieve  v{T,  N  +  1)  is  the  optimal  decoding 
result. 


4.5  Polyline  Representation 

After  identifying  the  vertical  position  of  a  line,  our  next  step  is  to  detect  the  left  and 
right  end  points  by  grouping  the  broken  line  segments  together.  For  each  detected  line, 
those  DSCCs  within  10  pixels  distance  to  the  detected  line  are  merged  [35].  Those  lines 
with  less  than  50  pixels  are  removed. 

An  ideal  straight  line  can  be  represented  with  two  parameters  a  and  b  as  y  =  ax  x  +  b. 
Practically,  a  real  line  is  represented  with  points  =  1,2, ...  ,n.  The  parameters 

a  and  b  can  be  estimated  based  on  the  minimum  mean  squared  error  criterion  (MMSE) 

n  n 

x  =  J2^i/n  y  =  Y.yi/^ 

i=l  i=l 
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^  E?=i(i  -x)(y-  y) 

T.U(x-xY 

b  =  y  —  a  X  X  (25) 

For  most  straight  lines,  this  approximation  is  good  enongh.  However,  dne  to  the  dis¬ 
tortion  introduced  by  photocopying  and  scanning,  some  lines  are  cursive,  and  can  not 
be  represented  by  two  end  points  well.  In  this  case,  a  polyline  representation  is  used  as 
follows 

1.  Calculate  the  average  approximation  error  of  a  line 

^Vi  =  IVi  -  a  X  Xi  -  b\  (26) 

n 

e  =  '^Syi/n  (27) 

i=l 

2.  If  e  is  smaller  than  the  average  line  width  (often  2-4  pixels),  keep  it  with  two  end 
points  representation,  and  exit. 

3.  Otherwise,  split  the  whole  line  into  two  segments  from  the  middle,  and  estimate 
the  line  parameters  a  and  b  for  each  segment  respectively,  as  described  in  Eq.  (25). 

4.  For  each  segment,  go  to  step  1  and  repeat. 

A  polyline  is  described  as  a  sequence  of  vertices  (Pi,  P2, . . . ,  Pm)-  Two  or  three  segments 
are  sufficient  to  represent  most  lines  in  our  following  experiments. 

5  Applications 

5.1  Known  Form  Processing 

The  application  of  the  algorithm  to  known  form  processing  is  straightforward.  Generally 
there  are  a  group  of  horizontal  and  vertical  parallel  lines  respectively  on  a  form.  There¬ 
fore,  we  use  two  HMM  models  to  detect  the  horizontal  and  vertical  lines  separately.  To 
apply  the  algorithm,  we  need  to  estimate  two  sets  of  parameters:  1)  The  distribution 
of  the  observation  symbols  of  each  state  (B  matrix  in  Section  4.2);  and  2)  The  state 
duration  probabilities  of  each  state. 

5.1.1  Estimation  of  the  Distributions  of  Observation  Symbols 

In  our  case,  the  observation  symbols  are  the  projection  prohle,  which  is  in  the  range 
of  [0,  tc]  (where  w  is  the  width  of  the  image).  As  we  addressed  in  Section  4.3,  a  large 
number  of  observation  symbols  would  cause  difficulties  in  estimating  the  distributions 
reliably  with  limited  training  samples.  Under  some  assumptions,  we  can  show  that  using 
the  Gaussian  distribution  to  model  projections  of  a  line  over  multiple  form  instances  is 
appropriate.  In  stochastic  document  image  degradation  models,  a  white  (black)  pixel  is 
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randomly  selected  and  flipped  to  black  (white)  [16].  The  projection  is  the  summation  of 
all  black  pixels  on  the  line 

N 

h  =  '£ai  (28) 

i=l 


where 

_  J  1  if  black  pixel  i  is  preserved 
*  1  0  if  black  pixel  i  is  flipped  to  white  during  degradation 


(29) 


Under  white  Gaussian  noise,  a  widely  used  model  for  degradation,  a*  follows  a  Bernoulli 
distribution:  a*  ^  Bernoulli{p),  where  p  is  the  probability  for  a  black  pixel  to  be  lost. 
Consequently,  h  follows  a  binomial  distribution  Bin{p,  N) 


(30) 


According  to  the  central  limit  law,  if  N  is  large  enough  (or  if  the  line  is  long  enough), 
then  the  distribution  of  random  variable  h  converges  to  a  Gaussian  distribution  [7] 


,.  h-  E[h] 
hm  , 

^Np{l  -  p) 


A/'(0, 1)  in  distribution 


(31) 


In  known  form  processing,  generally,  a  set  of  forms  are  captured  with  the  similar  imaging 
conditions.  Therefore  p  is  roughly  consistent  for  each  form  instance  in  the  set.  So  a 
Gaussian  distribution  is  a  good  approximation  for  the  projections  of  a  line  over  multiple 
form  instances.  The  mean  and  variance  of  the  Gaussian  distribution  can  be  estimated 
from  the  groundtruth.  Figs.  6a  to  6f  show  the  distributions  of  the  projections  of  all 
six  horizontal  lines  on  a  set  of  bank  deposit  forms  with  one  instance  shown  in  Fig. 
la.  The  histogram  is  generated  over  100  form  samples.  We  can  see  that  the  Gaussian 
distribution  is  a  good  approximation.  For  non-line  states,  the  approximation  is  not  good 
enough  since  a  projection  is  always  larger  than  zero,  as  shown  in  Fig.  6g.  But  we  found 
in  the  experiments  that  the  affect  of  this  approximation  error  is  negligible  for  the  final 
line  detection  result. 


5.1.2  Estimation  of  the  State  Durations 

The  state  duration  of  So^i,  f  =  l,2,...,A^  —  1,  represents  the  line  gap  between  lines  i  and 
i  -\-l,  which  can  be  estimated  from  the  groundtruth.  Table  1  shows  the  distribution  of 
the  gap  between  the  first  and  second  horizontal  lines  on  the  bank  deposit  form  in  our 
database  with  100  samples.  The  average  value  of  the  gap  is  94  pixels.  The  row  of  distance 
lists  the  difference  to  the  average  value.  The  row  of  raw  occurrence  shows  the  number 
of  occurrences  the  gap  takes  a  specific  value.  We  can  see  that  the  variation  is  from  -9 
pixels  to  6  pixels,  and  the  distribution  is  roughly  symmetric  around  the  average  value. 
Due  to  the  sparse  data  problem,  some  entries  within  the  range  of  [-9,  6]  are  not  observed 
in  the  training  set,  which  will  deteriorate  the  performance.  Therefore,  data  smoothing  is 
used.  The  row  of  symmetric  regularization  is  the  result  after  we  impose  the  symmetry 
(Eq.  (16)).  At  last,  we  set  the  zero  entries  to  the  minimal  value  of  all  non-zero  entries,  as 
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Figure  6:  The  distributions  of  the  observation  symbols  (horizontal  projections)  for  100 
scanned  instances  of  a  bank  deposit  form.  One  example  of  the  form  is  shown  in  Fig.  la. 
There  are  six  horizontal  lines  on  the  form,  (a)  to  (f)  the  histograms  of  the  projections 
of  six  horizontal  lines  respectively;  (g)  the  histogram  of  the  projections  of  the  non-line 
states. 


Table  1;  The  distribution  of  the  line  gap  between  the  first  and  second  horizontal  lines  on 
a  bank  deposit  form.  The  average  is  94  pixels.  The  row  of  distance  lists  the  difference 
to  the  average  value. _ 
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shown  in  the  row  of  zero- occurrence  smoothing.  After  data  smoothing,  the  distribution 
Pi  (d)  can  be  estimated  by  normalization.  Similarly,  we  can  get  the  distributions  of  other 
line  gaps. 

5.1.3  Decoding 

After  estimating  parameters,  we  use  the  Viterbi  algorithm  described  in  Section  4.4  to 
decode  the  observation.  Fig.  7  shows  the  decoding  results  of  the  Viterbi  algorithm 
on  the  horizontal  and  vertical  projection  profiles  of  the  bank  deposit  form  (Fig.  la). 
The  locations  picked  up  by  the  Viterbi  algorithm  are  labeled  with  red  squares.  We  can 
see  that  instead  of  picking  the  highest  peaks  as  detected  lines  in  the  projection  based 
methods  [2,21],  our  approach  outputs  the  line  positions  which  are  most  compatible  with 
the  model. 

After  detecting  the  horizontal  and  vertical  lines,  the  method  described  in  Section 
4.5  can  be  used  to  determine  the  end  points  of  the  lines.  However,  if  a  line  is  severely 
degraded,  the  end  points  can  not  be  determined  accurately.  In  form  processing,  the  cross 
points  of  horizontal  and  vertical  lines  can  be  used  to  determine  the  end  points.  Short  lines 
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Figure  7;  The  lines  detected  after  decoding  the  HMM  models  using  the  Viterbi  algorithm 
on  the  horizontal  (a)  and  vertical  (b)  projection  profiles  of  the  bank  deposit  form.  The 
original  form  is  shown  in  Fig.  la.  The  locations  picked  up  by  the  Viterbi  algorithm 
are  labeled  with  red  squares.  We  can  see  that  instead  of  picking  the  highest  peaks  as 
detected  lines  in  the  projection  based  methods,  our  approach  outputs  the  line  positions 
which  are  most  compatible  with  the  model. 


lying  on  the  same  line  (i.e.,  three  dotted  lines  in  the  middle  of  the  form,  as  shown  in  Fig. 
la)  are  merged  as  one  line  in  the  HMM  model.  After  detecting  all  lines,  each  individual 
lines  can  be  obtained  using  the  geometric  relationship  of  the  end  points  to  other  lines. 
Fig.  Ic  shows  the  model  based  line  detection  result.  We  can  see  our  method  can  even 
detect  the  short  lines  which  do  not  form  peaks  on  the  projection  profile  (especially  for  the 
two  shortest  vertical  lines)  which  are  most  likely  missed  by  other  methods,  such  as  the 
Hough  transform  or  projection  based  methods.  Our  method  outputs  the  exact  number 
of  lines  indicated  by  the  model  without  false  alarms.  Fig.  8  shows  two  more  examples 
of  an  export  registration  form  used  by  the  Customs  Bureau  of  China  and  a  portion  of  a 
U.S.  income  tax  return  form. 

Another  advantage  of  our  HMM  based  form  processing  approach  is  that  it  can  be 
extended  for  form  identification  easily.  Suppose  there  are  n  form  templates  Ai,  A2, . . . ,  A„. 
According  to  the  Bayesian  rule,  A  which  maximizes  the  posteriori  probability  is  selected 
as  the  template  for  the  input  form 

A  =  ar^maxP(Aj|0)  =  ar^  max  P(0|Aj)P(Aj)  (32) 

where  O  is  the  observation  (the  projection  profile  in  our  method),  P{Xi)  is  the  priori 
probability  of  form  template  A^,  and  P(0|A)  is  the  probability  of  observing  the  sequence 
of  observations  given  the  model,  which  can  be  calculated  efficiently  with  the  forward 
algorithm  [27]. 
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Figure  8;  Some  examples  for  model-based  form  frame  line  detection,  (a)  and  (b)  An 
export  registration  form  used  by  the  Customs  Bureau  of  China  and  the  corresponding 
line  detection  result,  (c)  and  (d)  A  portion  of  a  U.S.  income  tax  return  form  and  the 
corresponding  line  detection  result.  Red  color  shows  the  detected  lines. 


5.2  Rule  Line  Detection 

In  this  section  we  use  the  method  to  detect  severely  broken  rule  lines.  Different  from 
the  previous  case,  the  number  of  lines  is  unknown,  and  the  vertical  line  gaps  may  vary 
in  different  images  due  to  the  different  style  of  the  rule  lined  paper,  or  different  scanning 
resolutions.  However  the  length  of  lines  and  the  vertical  line  gaps  are  roughly  consistent 
in  the  same  document  image. 

5.2.1  Vertical  Line  Gap  Estimation 

We  need  to  estimate  the  average  vertical  line  gap  from  the  input  image  hrst.  Since  the 
line  gaps  between  neighboring  lines  are  roughly  the  same,  the  horizontal  projection  of  rule 
lines  is  a  periodic  signal  (the  period  is  the  average  vertical  line  gap  g).  We  use  an  auto¬ 
correlation  based  approach  to  estimate  the  period  of  the  projection.  The  auto-correlation 
of  a  signal  x,  with  n  samples  x{l),x{2),  •  •  • ,  x{n),  is  defined  as 

n—l 

R{1)  =  '^x{i)x{i -\- 1)  /  =  0, 1, . . . ,  n  —  1  (33) 

i=l 

The  distance  between  the  hrst  two  peaks  of  the  auto-correlation  is  taken  as  the  vertical 
line  gap,  as  shown  in  Fig.  9. 

To  show  the  accuracy  of  the  method,  we  compare  the  estimate  with  the  actual  vertical 
line  gap  of  the  groundtruthed  lines.  The  page  level  estimation  error  is  dehned  as  the 
average  of  the  estimation  errors  of  all  vertical  line  gaps  on  a  document  page.  For  168 
images  in  our  database,  most  page  level  estimation  errors  are  within  0.5  pixels.  The 
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Figure  9:  Vertical  line  gap  estimation  for  rule  line  detection  based  on  the  auto-correlation 
of  the  projection  profile. 
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Figure  10;  A  simplified  HMM  model  for  rule  line  detection. 

maximum  page  level  estimation  error  is  1.3  pixels  due  to  the  large  vertical  line  gap 
variance. 

5.2.2  A  Simplified  Model 

In  order  to  reduce  the  complexity  of  the  model  (the  number  of  states  and  parameters), 
we  further  simplify  it  by  considering  the  special  properties  of  rule  lines.  Since  the  vertical 
line  gaps  and  the  lengths  of  rule  lines  are  roughly  consistent  in  the  same  document  image, 
we  can  merge  states  SG,i,  i  =  1,2, . . . ,  N  —  1  into  one  state  So,  and  3^,1,  i  =  1,2, . . . ,  N 
into  another  state  S^.  The  simplified  model  is  shown  in  Fig.  10.  State  merging  reduces 
the  number  of  parameters  significantly.  Another  advantage  of  such  simplification  is  that 
we  do  not  need  to  explicitly  know  how  many  lines  are  on  a  document. 

5.2.3  Parameter  Estimation 

In  our  data  set,  the  quality  of  different  images  are  significantly  different  from  each  other. 
And  the  degradation  of  rule  lines  are  different  on  the  same  image  too.  Therefore,  we  can 
not  use  the  Gaussian  distribution  to  model  the  projections  of  rule  lines  (the  mixed  Gaus- 
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Table  2:  Observation  probability  distribution  matrix  B  estimated  from  the  training  set 
containing  100  documents _ 
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sian  distribution  may  be  a  good  approximation).  Instead,  we  quantize  the  observation 
into  several  levels,  and  estimate  the  probability  of  each  quantized  level  directly  from  the 
groundtruthed  data  set.  Peaks  on  the  projection  prohle  are  of  particular  significance  for 
line  detection.  Therefore,  we  first  set  all  non-peaks  on  the  profile  to  zero,  then  quantize 
the  peaks  on  the  projection  profile  into  four  levels  using  the  following  quantization  levels: 
tc/lG,  tc/S,  and  ty/4,  where  w  is  the  image  width.  The  observation  probability  distri¬ 
bution  matrix  B,  estimated  from  the  training  set  containing  100  documents,  is  listed  in 
Table  2.  Here,  we  let  states  St,Sb,  and  Sq,  whose  observations  are  the  projection  of 
text  or  noise,  share  the  same  observation  distribution.  We  observed  that  1)  due  to  the 
severe  brokenness,  the  horizontal  projections  of  about  80%  rule  lines  are  less  than  1/4 
of  the  image  width;  2)  4.7%  rule  lines  do  not  form  peaks;  and  3)  the  peaks  with  small 
heights  are  more  likely  formed  by  text  strokes  or  noise  (2,052  instances),  rather  than  by 
rule  lines  (246  instances).  Therefore  we  need  to  use  high  level  contextual  information  to 
achieve  reasonable  detection  results  for  these  severely  broken  lines. 

We  set  duration  probability  of  states  St  and  Sb  to  the  uniform  distribution  on  [0,  g  — 
1].  The  duration  probability  of  state  So  is  estimated  directly  from  the  groundtruth 
with  the  same  method  discussed  in  Section  5.1.2.  With  all  these  settings,  the  rule  line 
detection  accuracy  on  the  training  set  is  about  95.6%.  For  comparison,  the  accuracy  is 
only  91.7%  if  we  use  the  Gaussian  distribution  to  approximate  the  observation.  Since  the 
parameters  estimated  from  the  training  data  are  not  optimal  for  the  ultimate  detection 
error  criterion,  the  simplex  method  proposed  by  Nelder  and  Mead  [25]  is  used  to  search 
the  optimal  parameter  set  which  minimizes  the  detection  error.  Among  parameters  of 
our  model,  we  only  optimize  the  observation  probability  matrix  B.  Experiments  show 
the  detection  accuracy  increases  to  97.3%  on  the  training  set  after  optimization. 

5.2.4  Examples 

Fig.  If  shows  the  model-based  line  detection  result  for  a  rule  lined  document.  Compared 
with  Fig.  le,  we  can  see  that  with  contextual  information  the  result  is  significantly 
improved.  Our  model-based  method  is  very  robust  even  when  the  input  images  do  not 
follow  the  model  exactly.  Fig.  11  shows  an  example:  two  pages  are  overlapped  during 
scanning.  Our  algorithm  still  detects  all  rule  lines  correctly.  In  Fig.  12a,  we  remove 
35  rows  of  the  image  (about  half  of  the  average  vertical  line  gap  of  this  document). 
The  variation  of  the  line  gap  is  out  of  the  range  allowed  by  the  HMM  model.  The 
corresponding  detection  result  is  shown  in  Fig.  12b,  with  only  one  line  missed  due  to  the 
anomalous  vertical  line  gap. 
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Figure  11:  (a)  The  original  image  with  two  pages  overlapped  during  scanning;  (b)  the 
corresponding  line  detection  result. 


(a)  (b) 


Figure  12:  An  example  of  a  model  mis-match.  (a)  A  document  image  with  35  image 
rows  removed;  (b)  the  corresponding  line  detection  result. 


6  Experiments 

In  this  section,  we  quantitatively  evaluate  the  robustness  of  our  line  detection  algorithm, 
and  compare  it  with  several  non-model-based  algorithms.  We  present  our  evaluation 
metrics  in  detail  hrst. 


6.1  Line  Detection  Evaluation  Protocol 


Line  detection  accuracy  can  be  evaluated  at  the  pixel  level  and  the  line  level  [23].  At  pixel 
level  we  compare  the  difference  of  the  pixels  between  groundtruthed  and  detected  lines. 
It  is  straightforward  and  objective,  but  groundtruthing  at  the  pixel  level  is  extremely 
expensive  when  lines  are  broken,  distorted  and/or  overlapped  with  text.  Therefore,  we 
evaluate  the  algorithm  at  the  line  level.  Our  evaluation  metrics  is  based  on  the  Hausdorff 
distance.  The  Hausdorff  distance  between  two  point  sets  is 

=  max{/i(A,H),/i(H,A)}  (34) 


where 


h{A,B) 


=  max  min  I  la  —  61 1 

a^A  beB 


(36) 


and  ||.||  is  an  underlying  norm  (e.g.,  the  L2  or  Euclidean  distance).  The  function  h{A,  B) 
is  called  the  directed  Hausdorff  distance  from  A  to  B.  It  identifies  the  point  a  E  A  that  is 
the  farthest  from  any  point  of  B  and  measures  the  distance  from  a  to  its  nearest  neighbor 
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Figure  13:  Hausdorff  distance  between  two  polylines. 


in  B  [12].  The  direct  computation  method  for  the  Hausdorff  distance  is  time  consuming. 
However,  for  polyline  representation,  the  Hausdorff  distance  can  be  easily  calculated. 
Suppose  polylines  A  and  B  are  represented  as  a  sequence  of  vertices  (Hi,  H2, . . . , 
and  (Hi,  H2, . . . ,  Bn)  respectively,  then  the  Hausdorff  distance  is  simplified  as 

H{A,  B)  =  max{H'(H,  H),  e(H,  B)}  (36) 

where 

H'{A,B)  =  max{DAi,  Da2,  ■  ■  ■ ,  DAm,  Dbi,  Db2,  ■  ■  ■ ,  Bbti}  (37) 

e{A,B)  =  max{||Hi  -  Hill,  ||H^  -  H„||}  (38) 


DAi  is  the  perpendicular  distance  from  Ai  to  polyline  H,  and  Dsi  is  the  perpendicular 
distance  from  H*  to  polyline  A,  as  shown  in  Fig.  13.  H'{A,  B)  in  Eq.  (37)  is  the 
perpendicular  distance  between  two  polylines  A  and  H,  which  evaluates  the  accuracy  in 
determining  the  vertical  location  of  a  horizontal  line  and  the  horizontal  location  of  a 
vertical  line.  \\Ai  —  Bj\\  is  the  Euclidean  distance  between  points  H*  and  Bj.  Suppose 
the  vertices  of  a  polyline  are  sorted  from  left  to  right  for  a  horizontal  line,  and  top  to 
bottom  for  a  vertical  line.  Then  ||Hi  —  Hi||  and  \  \Am  —  H„||  are  the  end  point  distances. 
Hausdorff  distance  H{A,  B)  in  Eq.  (36)  combines  perpendicular  distance  and  end  point 
distance  into  one  metric. 

For  severely  broken  lines,  however,  it  is  very  hard  to  define  the  end  points  exactly. 
Therefore,  we  prefer  to  use  two  separate  metrics;  perpendicular  distance  and  end  point 
distance  for  evaluation,  instead  of  a  combined  Hausdorff  distance. 

The  absolute  value  of  the  end  point  distance  is  not  suitable  for  evaluating  both  short 
and  long  lines.  As  a  supplemental  metric,  the  overlapping  rate  of  polylines  A  and  H 


o(H,H) 


min{H^,  H„}  -  max{Hi,  Hi} 
max{Am,  Bn}  -  min{Hi,  Hj 


(39) 


is  dehned  to  evaluate  the  relative  end  point  determination  error. 

For  evaluation,  we  need  to  hnd  the  one-to-one  correspondence  between  the  detected 
and  groundtruthed  lines  first.  The  method  proposed  in  [20]  is  used  to  find  such  correspon¬ 
dence.  Suppose  the  detected  lines  are  La^i,  i  =  1,2, . . .  ,m,  and  the  groundtruthed  lines  are 
Lgj,j  =  1,2, . . .  ,n.  We  want  to  find  an  optimal  one-to-one  mapping  F  :  >  {Lgj}, 

such  that 

'^dist{Ld,i,Lg^F{i))  (40) 
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Figure  14:  Histogram  of  the  distances  of  groundtruth  rule  lines  to  the  corresponding 
detected  lines. 

is  minimized.  The  distance  is  the  Hausdorlf  distance  dehned  above.  The  optimal  mapping 
can  be  achieved  efficiently  using  the  Hungarian  algorithm  [20].  For  line  detection,  a 
correctly  detected  line  should  not  be  far  away  from  its  corresponding  groundtruthed  line. 
Let  Djnax  be  the  threshold  of  the  maximal  tolerable  localization  error.  The  optimal 
matching  problem  in  Eq.  (40)  is  converted  into  a  constrained  optimization  problem 

dist{Ld,i,  Lg^F(i))  <  Umax  for  7  =  1,  2, . . . ,  m  (41) 

If  the  numbers  of  detected  and  groundtruthed  lines  are  not  equal,  we  add  some  dummy 
lines  L dummy,  such  that 

dist(^Ld,i,  Ldummy^  Djuax  fo^"  ^  1,  2,  .  .  .  ,  777. 

dist(^L(iummy ,  ^g,j)  —  d^max  fo^  j  —  1,  2,  .  .  .  ,  77  (42) 

The  missed  groundtruthed  lines  and  the  falsely  detected  lines  can  be  matched  to  the 

dummy  lines.  If  the  distance  of  a  pair  of  matched  lines  is  less  than  Dmin,  then  the  line 

is  correctly  detected;  otherwise  it  is  partially  detected  with  a  large  location  error.  As 
suggested  in  [17],  we  set  Dmin  =  5  and  Dmax  =  10  pixels  respectively. 

6.2  Evaluation  of  Rule  Line  Detection 

We  obtained  168  Arabic  document  images  with  a  total  of  3,870  groundtruthed  rule  lines, 
most  of  which  are  severely  broken.  We  use  100  images  to  train  the  HMM  model,  and  the 
remaining  68  images  as  the  test  set.  The  detection  results  are  shown  in  Table  3.  On  the 
test  set,  96.8%  lines  are  detected  correctly  and  only  2  lines  are  missed.  The  histogram 
of  the  distances  for  all  matched  pairs  is  shown  in  Fig.  14.  We  can  see  that  most  of 
the  detected  lines  (more  than  90%)  match  the  corresponding  lines  within  3  pixels.  The 
false  alarm  rate  is  about  2.3%.  Most  of  the  false  alarms  are  caused  by  the  inconsistency 
between  the  detector  and  the  subjective  judgment  of  the  groundtruther  when  lines  are 
severely  broken. 

For  correctly  detected  lines  we  evaluate  the  end  point  determination  accuracy  using 
the  end  point  distance  and  overlapping  rate  defined  in  Eq.  (38)  and  (39)  respectively. 
The  average  end  point  distance  is  6  pixels  and  the  overlapping  rate  is  99.1%. 

Eor  comparison,  we  compared  our  model-based  line  detection  algorithm  with  other 
non-model-based  line  detection  algorithms:  the  Hough  transform  method,  the  projection 
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Table  3:  Quantitative  evaluation  of  the  rule  line  detection  result. 


Groundtruthed  Lines 

Detected  Lines 

Correct 

Partial  Correct 

Missed 

False  Alarm 

Training  Set 

2,274 

2,319 

2,212  (97.3%) 

56  (2.5%) 

6  (0.3%) 

51  (2.2%) 

Test  Set 

1,596 

1,631 

1,545  (96.8%) 

49  (3.0%) 

2  (0.1%) 

37  (2.3%) 

Table  4:  Comparison  of  our  model-based  method  with  other  methods  on  the  test  set 


(there  are  a  total  of  1,596  groundtruthed  lines). 


Detected  Lines 

Correct 

Partial  Correct 

Missed 

False  Alarm 

Hough  Transform 

1,747 

1,354  (84.8%) 

113  (7.1%) 

129  (8.1%) 

280  (17.5%) 

DSCC 

2,162 

1,398  (87.6%) 

118  (7.4%) 

o 

lO 

o 

00 

646  (40.5%) 

Projection  Method  1 

1,969 

1,384  (86.7%) 

117  (7.3%) 

95  (6.0%) 

468  (29.3%) 

Projection  Method  2 

4,387 

1,472  (92.2%) 

77  (4.8%) 

47  (2.9%) 

2,838  (177.8%) 

Our  Model-Based  Method 

1,631 

1,545  (96.8%) 

49  (3.0%) 

2  (0.1%) 

37  (2.3%) 

based  method,  and  the  DSCC  method.  The  line  detection  results  on  the  test  set  with 
different  algorithms  are  shown  in  Table  4.  To  reduce  the  false  alarm  rate  of  the  Hough 
transform  method,  we  restrict  the  search  range  of  0  around  0°  after  skew  correction.  For 
any  pair  of  detected  lines  which  are  too  close  to  each  other,  we  remove  the  one  with 
fewer  black  pixels.  The  accuracy  of  the  Hough  transform  method  is  about  84.8%.  Two 
variations  of  the  projection  method  are  implemented  and  compared.  Projection  methods 
1  and  2  in  Table  4  correspond  to  with  and  without  close  line  removal  respectively.  We 
can  see  that  without  close  line  removal,  the  accuracy  is  as  high  as  92.2%,  but  the  false 
alarm  rate  is  also  very  high  (177.8%).  Close  line  removal  signihcantly  reduces  the  false 
alarm  rate  to  29.3%,  but  the  accuracy  drops  to  86.7%  as  well.  For  the  DSCC  method, 
we  restrict  the  merging  direction  to  the  horizontal  direction.  Since  two  DSCCs  can  be 
merged  only  when  their  heights  are  compatible  [35],  the  false  alarm  rate  of  the  DSCC 
method  is  much  lower  than  the  projection  and  Hough  transform  methods  without  close 
line  removal.  Therefore,  we  did  not  apply  close  line  removal  to  the  result  of  the  DSCC 
method.  As  expected,  our  model  based  method  achieved  much  better  results  in  both 
accuracy  and  false  alarm  rate,  due  to  the  high  level  constraints  between  neighboring 
lines  used.  We  observed  that  our  model-based  method  can  achieve  higher  accuracy  than 
the  projection  method  without  close  line  removal,  indicating  that  some  severely  broken 
rule  lines  are  detected  successfully  even  when  they  do  not  form  peaks  on  the  projection 
prohle. 

6.3  Evaluation  of  Known  Form  Processing 

To  evaluate  the  algorithm  for  known  form  processing,  we  scanned  100  bank  deposit 
forms  with  reasonable  image  quality.  We  found  in  the  experiment  that  only  a  few  training 
samples  are  needed  to  achieve  good  result  if  the  image  quality  is  reasonable.  We  randomly 
selected  5  samples  for  training,  and  found  that  all  the  detected  lines  are  within  8  pixels 
to  the  groundtruth  without  false  alarms.  Then  we  test  the  robustness  of  our  method  to 
image  degradations.  We  randomly  flip  a  certain  ratio  of  black  pixels  on  lines  to  white, 
and  keep  all  pixels  on  text  unchanged.  Generally,  the  more  severe  the  degradation  is. 
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Figure  15;  Degraded  form  documents  and  the  corresponding  line  detection  results.  Red 
lines  drawn  on  original  images  indicate  the  detected  lines,  (a)  About  50%  of  the  pixels  of 
the  lines  are  flipped;  (b)  line  detection  result  of  (a);  (c)  about  90%  of  pixels  of  the  lines 
are  flipped;  (d)  line  detection  result  of  (c). 


the  more  accurate  the  model  should  be  in  order  to  detect  lines  correctly.  Therefore, 
in  the  following  experiments  we  increase  the  number  of  training  samples.  50  forms  are 
randomly  selected  for  training  and  the  remaining  50  forms  for  testing.  Figs.  15a  and 
15c  show  the  degraded  images  with  50%  and  90%  black  pixels  on  lines  flipped  to  white. 
As  shown  in  Fig.  15b,  the  line  detection  result  is  perfect  even  half  of  the  black  pixels 
are  flipped.  Fig.  15d  shows  that  the  horizontal  lines  are  still  detected  correctly  even 
when  90%  black  pixels  are  flipped,  but  the  vertical  lines  are  misdetected.  The  detection 
accuracy  vs.  degradation  on  the  test  set  is  shown  in  Fig.  16.  We  can  see  that  our  method 
is  very  robust.  It  maintains  good  results  with  accuracy  of  96.2%  even  when  80%  black 
pixels  of  lines  are  flipped. 

Our  method  is  fast.  The  average  processing  time  for  an  image  with  the  size  of 


Ratio  of  black  pixels  flipped 


Figure  16;  Line  detection  accuracy  vs.  ratio  of  black  pixels  flipped  on  lines. 
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1,700x1,800  pixels  is  about  0.4  seconds  on  a  PC  with  1.8GHZ  CPU  and  1GB  memory. 
Most  of  the  computation  is  spent  on  the  preprocessing,  such  as  skew  angle  estimation 
and  correction. 

7  Conclusion  and  Future  Work 

In  this  paper  we  present  a  novel  approach  to  detect  severely  broken  parallel  lines  in  doc¬ 
uments.  Our  method  is  based  on  a  stochastic  model  to  incorporate  high  level  constraints 
into  a  general  line  detection  algorithm.  Instead  of  detecting  lines  individually,  we  use 
the  Viterbi  algorithm  to  detect  all  parallel  lines  simultaneously.  Our  method  can  detect 
96.8%  severely  broken  rule  lines  in  the  Arabic  database  we  collected.  Some  challenging 
examples  demonstrated  the  robustness  of  our  approach.  For  known  form  processing,  it 
can  achieve  good  results  even  when  about  80%  black  pixels  on  lines  are  flipped  to  white 
during  degradation. 

Our  method  can  be  extended  for  form  identification  easily.  The  probability  of  observ¬ 
ing  a  sequence  of  observations  given  the  model,  which  is  useful  for  form  identification, 
can  be  calculated  with  the  forward  algorithm  efficiently.  We  are  currently  incorporating 
the  method  into  a  known  form  processing  system  for  both  form  identification  and  reg¬ 
istration.  Our  next  focus  is  to  develop  a  robust  line  removal  algorithm  to  enhance  the 
image  quality  of  documents. 
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